Goto

Collaborating Authors

 linear regression model








Hunting for Discriminatory Proxies in Linear Regression Models

Neural Information Processing Systems

A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a ``business necessity''. Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.


Hunting for Discriminatory Proxies in Linear Regression Models

Samuel Yeom, Anupam Datta, Matt Fredrikson

Neural Information Processing Systems

A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a "business necessity". Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.


EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions

Welbaum, Andrew, Qiao, Wanli

arXiv.org Machine Learning

In a mixture of linear regression model, the regression coefficients are treated as random vectors that may follow either a continuous or discrete distribution. We propose two Expectation-Maximization (EM) algorithms to estimate this prior distribution. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE). This method not only recovers continuous prior distributions but also accurately estimates the number of clusters when the prior is discrete. The second algorithm, designed to approximate the NPMLE, targets prior distributions with a density. It also performs well for discrete priors when combined with a post-processing step. We study the convergence properties of both algorithms and demonstrate their effectiveness through simulations and applications to real datasets.


Supplementary material 1 Dataset documentation

Neural Information Processing Systems

In this section, we follow the Datasheets for Datasets framework Gebru et al., 2020 to document the Who created the dataset (e.g., which team, research group) and on behalf of which Who funded the creation of the dataset? This work is funded by Digital Futures in the project EO-AI4GlobalChange. What do the instances that comprise the dataset represent (e.g., documents, photos, Each instance is one image consisting of 23 channels. How many instances are there in total (of each type, if appropriate)? There are 13 607 images in total.